73 research outputs found

    Finite-state description, developing mental awareness

    Get PDF
    In this article, we approach finite-state description practices that must be instilled in the developer. Thoughts are presented accompanied by reference to concrete experiences with different languages and their description. We contend that finite-state description of languages leads to development in the describer-developer. This presupposes regular interaction with developers of upstream and downstream technologies. And as more languages are described, the developer learns what to choose as a starting point, hopefully with the help of a researcher, research documentation or native speaker well versed in the workings of the language. We maintain that finite-state work should serve more than one purpose or audience, and that, as linguists, we should be raising the bar by applying the knowledge of research to description, so that our understanding of the linguistic phenomena can be attested by others or proven false. We are providing a methodology for repeatable experimentation and rule making. We see that each language provides something unique, while sharing some recognizable features with other languages. We stress the necessity to avoid generating characters from epsilons and offer examples where it is possible to write rules that reduce characters to epsilons instead. We also stress the need to describe the predictable infinite set of all native phenomena, whereas the unknown and random qualities introduced through language contact cannot form a foundation for our descriptions. Finally, we call for a playful approach to phenomena in a language, because that might bring us closer to how a child would learn the language – through repetition, mistakes and self-correction.Peer reviewe

    The Livonian-Estonian-Latvian Dictionary as a threshold to the era of language technological applications

    Get PDF
    This article outlines the multiple use of electronic source materials from the Livonian-Estonian-Latvian Dictionary of 2012 in a “Kone Foundation” funded project for developing finite-state morphological parsers. It provides an introduction to the project, the language-independent Giellatekno infrastructure at Tromsø, Norway, and the materials utilized in the electronic manuscript of the dictionary. The introduction is followed by an extensive description of what has been developed on the Giellatekno infrastructure with explicit indications of where parallel projects might be initiated.Kokkuvõte. Jack Rueter: Liivi-eesti-läti sõnaraamat lävepakuna keeletehnoloogiliste rakenduste ajastusse. Artikkel annab ülevaate elektroonilise lähtematerjali “Liivi-eesti-läti sõnaraamat 2012” mitmekülgsest kasutamisest Kone fondi rahastatud projektis morfoloogiliste analüsaatorite arendamiseks. Artikli sissejuhatav osa esitab sissevaate projekti ning Tromsøs loodud keelest sõltumatusse Giellatekno taristusse; tutvustatakse ka sõnastiku elektroonilises käsikirjas kasutatud materjale. Seejärel kirjeldatakse Giellatekno tarkvara arendusega loodud võimalusi ning tuuakse näiteid sellest, kuidas saab sarnaseid projekte algatada.Märksõnad: liivi keel, uurali keeled, Kone keeleprogramm, avatud lähtekood, keelest sõltumatu infrastruktuur, HFST, Giellatekno, morfoloogiline analüsaator, õigekirjakontroll, “Morphology-savvy” veebisõnastik, arvutipõhine keeleõpeKubbõvõttõks. Jack Rueter: Līvõkīel-ēstikīel-lețkīel sõnārǭntõz nemē kīeltehnolōgilizt kȭlbatimizt āiga kīndõks. Kēra āndab iļļõvaņtļimiz iļ “Līvõkīel-ēstikīel-lețkīel sõnārǭntõ” amāpūoļiz kȭlbatõmiz, laz kazāntõg loptõb morfolōgiliži analīzijidi. Sīe projektrǭ āndaji um Kone fond. Kēra klīerõb īžpīlijizt Giellatekno infrastruktūrõ, mis um lūodõd Tromsøs Norvēgjis, ja sõnārǭntõ elektrōnilizõs kädkēras kȭlbatõd materiālidi. Nei īž kēra nīžõb iļ võimizt, mis tarmõb Giellatekno lūodõd programvīļa ja nägțõb, kui tämvītliži projektidi võib irgtõ

    Linguistic Distance between Erzya and Moksha. Dependent Morphology

    Get PDF
    The purpose of this article is to outline morphological facts about the two literary languages Erzya and Moksha, which can be used for estimating the distinctive character of these individual language forms. Whereas earlier morphological evaluations of the linguistic distance between Erzya and Moksha have placed them in the area of 90% cohesion, this one does not. This study evaluates the languages on the basis of non-ambiguity, parallel sets of ambiguity and divergent ambiguity. Non-ambiguity is found in combinatory function to morphological formant alignment, e.g. молян go+V+Ind+Prs+ScSg1. Parallel sets of ambiguity is found in combinatory-function set to morphological formant alignment where both languages share the same sets of ambiguous readings, e.g. саизь v s сявозь take+V+Ind+ScPl3+OcSg3, ScPl3+OcPl3. Divergent ambiguity is found in forms with non- symmetric alignments of combinatory functions, e.g. саинек take+V+Ind+Prt1+ScPl1, +Prt1+ScPl1+OcSg3, +Prt1+ScPl1+OcPl3 vs сявоме take+V+Ind+Prt1+ScPl1, сявоськ take+V+Ind+Prt1+ScPl1+OcSg3, +Prt1+ScPl1+OcPl3. This morphological evaluation will establish the preparatory work in syntactic disambiguation necessary for facilitating Erzya↔Moksha machine translation, whereas machine translation will enhance the usage of mutual language resources. Results show that the Erzya and Moksha languages, in the absence of loan words from the 20 th century, share less than 50% of their vocabularies, 63% of their regular nominal declensions and 48% of their regular finite conjugations.Peer reviewe

    Synchronized Mediawiki based analyzer dictionary development

    Get PDF
    Open-source analyzer dictionary development is being implemented for Skolt Sami, Ingrian, Moksha-Mordvin, etc. in the Helsinki CSC infrastructure; home of the Finnish Kielipankki ’Language Bank’ and Termipankki ’Term Bank’. The proximity of minority-language corpora in need of annotation and the multiple usage of controlled wikimedia-type dictionaries make CSC an attractive site for synchronized transducer dictionary development. The open-source FST develop- ment of Uralic and other minority languages at Giellatekno-Divvun in Tromsø demonstrates a vast potential for reusage of FST-s, only augmented by open- source work in OmorFi, Apertium and Universal Dependency . The initial idea is to allow synchronized editing of Giellatekno xml and CSC wiki structures via github. In addition to allowing for simple lexc LEMMA:STEM CONTINUATION_LEXICON ”TRANS- LATION” ; line exports, the parallel dictionaries will provide for documentation of derivation, morpho-syntactic information on valency and government, seman- tics and etymology.Peer reviewe

    The first complete scientific grammar of Skolt Saami in English

    Get PDF
    Timothy Feist: A Grammar of Skolt Saami. Mémoires de la Société Finno-Ougrienne 273. Finno-Ugrian Society. Helsinki 2015. 414 p

    Skolt Sami, the makings of a pluricentric language, where does it stand?

    Get PDF
    This paper will provide a brief description of Skolt Sami and how it might be construed as a pluricentric language. Historical factors are identified that might contribute to a pluricentric identity: geographic location and political history; shortages of language documentation, and the establishment of a normative body for the development of a standard language. Skolt Sami is assessed in the context of Sami languages and is forwarded as one of a closely related yet distinct language group. Here the issue then becomes one of facilitating diversity even for under-documented languages. And we aptly describe opportunities in language technology that have been utilized to this end. Finally, brief insight is given for other Uralic languages with regard to pluricentric character and possibilities for language users to facilitate the maintenance of their individual language needs.Peer reviewe

    Finding Sami Cognates with a Character-Based NMT Approach

    Get PDF
    Peer reviewe

    FST Morphology for the Endangered Skolt Sami Language

    Get PDF
    Peer reviewe
    corecore